Goto

Collaborating Authors

 brier score


CalArena: A Large-Scale Post-Hoc Calibration Benchmark

arXiv.org Machine Learning

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.


Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics

arXiv.org Machine Learning

Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.







Incoherent Beliefs & Inconsistent Actions in Large Language Models

arXiv.org Artificial Intelligence

Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.


Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms

arXiv.org Artificial Intelligence

Abstract--Ride-hailing platforms are characterized by high-frequency, behavior-driven environments, such as shared mobility platforms. Although survival analysis has been widely applied to recurrent events in other domains, its use for modeling ride-hailing driver behavior remains largely unexplored. T o the best of our knowledge, this study is the first to formulate driver idle behavior as a recurrent survival process using large-scale platform data. This study proposes a survival analysis framework that uses a Transformer-based temporal encoder with causal masking to capture long-term temporal dependencies and embeds driver-specific embeddings to represent latent individual characteristics, significantly enhancing the personalized prediction of driver retention risk, modeling how historical idle sequences influence the current risk of leaving the platform via trip acceptance or log-off. The model is validated on datasets from the City of T oronto over the period January 2 to March 13, 2020. The results show that the proposed Frailty-A ware Cox Transformer (F ACT) delivers the highest time-dependent C-indices and the lowest Brier Scores across early, median, and late follow-up, demonstrating its robustness in capturing evolving risk over a driver's lifecycle. This study enables operators to optimize retention strategies and helps policy makers assess shared mobility's role in equitable and integrated transportation systems. The purpose of this study is to model the driver retention behavior through a transformer-based survival model. Shared mobility services, such as ride-hailing, car-sharing, and bike-sharing, are becoming an increasingly prominent component of contemporary transportation systems. These services are central to the broader concept of Mobility as a Service (MaaS) [1], which aims to integrate various forms of transport into a unified and user-centric platform.


Let the Experts Speak: Improving Survival Prediction & Calibration via Mixture-of-Experts Heads

arXiv.org Artificial Intelligence

Deep mixture-of-experts models have attracted a lot of attention for survival analysis problems, particularly for their ability to cluster similar patients together. In practice, grouping often comes at the expense of key metrics such as calibration error and predictive accuracy. This is due to the restrictive inductive bias that mixture-of-experts imposes, that predictions for individual patients must look like predictions for the group they're assigned to. Might we be able to discover patient group structure, where it exists, while improving calibration and predictive accuracy? In this work, we introduce several discrete-time deep mixture-of-experts (MoE)-based architectures for survival analysis problems, one of which achieves all desiderata: clustering, calibration, and predictive accuracy. We show that a key differentiator between this array of MoEs is how expressive their experts are. We find that more expressive experts that tailor predictions per patient outperform experts that rely on fixed group prototypes.